Intro

Juan Nunez-Iglesias
Victorian Life Sciences Computation Initiative (VLSCI)
University of Melbourne

Quick example: gene expression, without numpy

	Cell type A	Cell type B	Cell type C	Cell type D
Gene 0	100	200	50	400
Gene 1	50	0	0	100
Gene 2	350	100	50	200



In [ ]:

    
gene0 = [100, 200, 50, 400]
gene1 = [50, 0, 0, 100]
gene2 = [350, 100, 50, 200]
expression_data = [gene0, gene1, gene2]

Why is this a bad idea?

Now with NumPy



In [ ]:

    
import numpy as np
a = np.array(expression_data)
print(a)

We are going to:

Obtain an RPKM expression matrix
Quantile normalize the data

using the awesome power of NumPy

Inside a numpy ndarray



In [ ]:

    
def print_info(a):
    print('number of elements:', a.size)
    print('number of dimensions:', a.ndim)
    print('shape:', a.shape)
    print('data type:', a.dtype)
    print('strides:', a.strides)
    print('flags:')
    print(a.flags)
    
print_info(a)



In [ ]:

    
print(a.data)



In [ ]:

    
abytes = a.ravel().view(dtype=np.uint8)



In [ ]:

    
print_info(abytes)



In [ ]:

    
print(abytes[:24])

Example: take the transpose of `a`



In [ ]:

    
print_info(a)



In [ ]:

    
print_info(a.T)

Example: skipping rows and columns with slicing



In [ ]:

    
print_info(a.T)



In [ ]:

    
print_info(a.T[::2])



In [ ]:

    
print_info(a.T[::2, ::2])

Getting a copy



In [ ]:

    
b = a



In [ ]:

    
print(b)



In [ ]:

    
a[0, 0] = 5
print(b)
a[0, 0] = 100

Advanced operations: axis-wise evaluation



In [ ]:

    
expr = np.load('expr.npy')



In [ ]:

    
print_info(expr)

This has the raw read count data. However, each sample gets a different number of reads, so we want to normalize by the library size, which is the total number of reads across a column.

The np.sum function returns the sum of all the elements of an array. With the axis argument, you can take the sum along the given axis.



In [ ]:

    
lib_size = np.sum(expr, axis=0)

Exercise

Generate a 10 x 3 array of random numbers. From each row, pick the number closest to 0.75. Make use of np.abs and np.argmax to find the column j which contains the closest element in each row.



In [ ]:

Advanced operations: broadcasting

In order to normalize every column by its corresponding library size, we have to align the two arrays' axes: each dimension must be either the same size, or one of the arrays must have size 1. Use np.newaxis to match the dimensions.



In [ ]:

    
print(expr.shape)
print(lib_size.shape)
print(lib_size[np.newaxis, :].shape)

However, NumPy will automatically prepend singleton dimensions until the array shapes match or there is an error:



In [ ]:

    
np.all(expr / lib_size ==
       expr / lib_size[np.newaxis, :])



In [ ]:

    
expr_lib = expr / lib_size

We also multiply by $10^6$ in order to keep the numbers on a readable scale (reads per million reads).



In [ ]:

    
expr_lib *= 1e6

Finally, longer genes are more likely to produce reads. So we normalize by the gene length (in kb) to produce a measure of expression called Reads Per Kilobase per Million reads (RPKM).



In [ ]:

    
gene_len = np.load('gene-lens.npy')
print(gene_len.shape)

Exercise: broadcast `expr_lib` and `gene_len` together to produce RPKM



In [ ]:

    
rpkm = expr_lib  # FIX THIS



In [ ]:

    
from matplotlib import pyplot as plt
from scipy import stats

def plot_col_density(data, xlim=None, *args, **kwargs):
    # Use gaussian smoothing to estimate the density
    density_per_col = [stats.kde.gaussian_kde(col) for col in data.T]
    if xlim is not None:
        m, M = xlim
    else:
        m, M = np.min(data), np.max(data)
    x = np.linspace(m, M, 100)

    plt.figure()
    for density in density_per_col:
        plt.plot(x, density(x), *args, **kwargs)
    plt.xlabel('log-counts')
    plt.ylabel('frequency')
    if xlim is not None:
        plt.xlim(xlim)
    plt.show()



In [ ]:

    
%matplotlib inline



In [ ]:

    
plt.style.use('ggplot')



In [ ]:

    
plot_col_density(np.log(expr+1))



In [ ]:

    
plot_col_density(np.log(rpkm + 1), xlim=(0, 250))

Exercise: 3D broadcasting

Below, produce the array containing the sum of every element in x with every element in y



In [ ]:

    
x = np.random.rand(3, 5)
y = np.random.randint(10, size=8)
z = x # FIX THIS

Exercise: explicit broadcasting and stride tricks

Use np.broadcast_arrays to get the same-shape arrays that numpy adds together. Then use print_info on the output. Notice anything weird?



In [ ]:

Stride tricks

By manipulating the shape and strides of an array, we can produce a "virtual" array much bigger than its memory usage:



In [ ]:

    
def repeat(arr, n):
    return np.lib.stride_tricks.as_strided(arr,
                                           shape=(n,) + arr.shape,
                                           strides=(0,) + arr.strides)



In [ ]:

    
repeat(np.random.rand(5), 4)

Exercise: `np.lib.stride_tricks.as_strided`

Use as_strided to produce a sliding-window view of a 1D array.



In [ ]:

    
def sliding_window(arr, size=2):
    """Produce an array of sliding window views of `arr`
    
    Parameters
    ----------
    arr : 1D array, shape (N,)
        The input array.
    size : int, optional
        The size of the sliding window.
        
    Returns
    -------
    arr_slide : 2D array, shape (N - size - 1, size)
        The sliding windows of size `size` of `arr`.
        
    Examples
    --------
    >>> a = np.array([0, 1, 2, 3])
    >>> sliding_window(a, 2)
    array([[0, 1],
           [1, 2],
           [2, 3]])
    """
    return arr  # fix this



In [ ]:

    
# test your code here
sliding_window(np.arange(8), 3)

Fancy indexing

You can index arrays with slicing, but also with boolean arrays (including broadcasting!), integer arrays, and individual indices along multiple dimensions.



In [ ]:

    
values = np.array([0, 5, 99])
selector = np.random.randint(0, 3, size=(3, 4))
print(selector)
print(values[selector])

Exercise: quantile normalization

Quantile Normalization(https://en.wikipedia.org/wiki/Quantile_normalization) is a method to align distributions. Implement it using NumPy axis-wise operations and fancy indexing.

Hint: look for documentation for scipy.mstats.rankdata, np.sort, and np.argsort.



In [ ]:

    
def qnorm(x):
    """Quantile normalize an input matrix.
    
    Parameters
    ----------
    x : 2D array of float, shape (M, N)
        The input data, with each column being a
        distribution to normalize.
        
    Returns
    -------
    xn : 2D array of float, shape (M, N)
        The normalized data.
    """
    xn = np.copy(x) # replace this by normalizing code
    return xn



In [ ]:

    
logexpr = np.log(expr + 1)
logrpkm = np.log(rpkm + 1)



In [ ]:

    
logexprn = qnorm(logexpr)
logrpkmn = qnorm(logrpkm)



In [ ]:

    
plot_col_density(logexprn)



In [ ]:

    
plot_col_density(logrpkmn, xlim=(0, 0.25))

Advanced exercise Jack's dilemma

(If time permits.)

Date: Wed, 16 Jul 2008 16:45:37 -0500
From: Jack Cook
To: <numpy-discussion@scipy.org>
Subject: Numpy Advanced Indexing Question

Greetings,

I have an I,J,K 3D volume of amplitude values at regularly sampled time intervals. I have an I,J 2D slice which contains a time (K) value at each I, J location. What I would like to do is extract a subvolume at a constant +/- K window around the slice. Is there an easy way to do this using advanced indexing or some other method? Thanks in advanced for your help.

-- Jack



In [ ]:

    
# "data"

ni, nj, nk = (10, 15, 20)
amplitude = np.random.rand(ni, nj, nk)
horizon = np.random.randint(5, 15, size=(ni, nj))



In [ ]:

Even more advanced: NumPy Array Interface

An author of a foreign package (included with the exercizes as problems/mutable_str.py) provides a string class that allocates its own memory:

ipython
In [1]: from mutable_str import MutableString
In [2]: s = MutableString('abcde')
In [3]: print s
abcde

You'd like to view these mutable (mutable means the ability to modify in place) strings as ndarrays, in order to manipulate the underlying memory.

Add an array_interface dictionary attribute to s, then convert s to an ndarray. Numerically add "2" to the array (use the in-place operator +=).

Then print the original string to ensure that its value was modified.

Hint: Documentation for NumPy's __array_interface__ may be found in the online docs.

Here's a skeleton outline:



In [ ]:

    
import numpy as np
from mutable_str import MutableString

s = MutableString('abcde')

# --- EDIT THIS SECTION ---

# Create an array interface to this foreign object
s.__array_interface__ = {'data' : (XXX, False), # (ptr, is read_only?)
                         'shape' : XXX,
                         'typestr' : '|u1', # typecode unsigned character
                         }

# --- EDIT THIS SECTION ---

print 'String before converting to array:', s
sa = np.asarray(s)

print 'String after converting to array:', sa

sa += 2
print 'String after adding "2" to array:', s



In [ ]: